This dataset is related to red variant of the Portuguese “Vinho Verde” wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009].This dataset was created, using red wine samples.The inputs include objective tests (e.g. PH values) and the output is based on sensory data(median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).
Importing the required libraries
library(dplyr)
library(ggplot2)
Loading the data
# Loading the Data
df_wine<- read.csv("wineQualityReds.csv")
head(df_wine) #Viewing the first few rows of data
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
str(df_wine) #Structure of the data set
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Attribute Information:
- fixed acidity (tartaric acid - g / dm^3) : most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
- volatile acidity (acetic acid - g / dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
- citric acid (g / dm^3): found in small quantities, citric acid can add ‘freshness’ and flavor to wines
- residual sugar (g / dm^3): the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
- chlorides (sodium chloride - g / dm^3): the amount of salt in the wine
- free sulfur dioxide (mg / dm^3): the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
- total sulfur dioxide (mg / dm^3): amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
- density (g / cm^3): the density of water is close to that of water depending on the percent alcohol and sugar content
- pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
- sulphates (potassium sulphate - g / dm3): a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
- alcohol (% by volume): the percent alcohol content of the wine
- quality (score between 0 and 10)
In this section I will be conducting some preliminary exploration of data
summary(df_wine$fixed.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
ggplot(aes(fixed.acidity), data = df_wine)+
geom_histogram()+
xlab("Fixed Acidity")+
ylab("Count")+
ggtitle("Histogram of Fixed Acidity and Count")
summary(df_wine$volatile.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
ggplot(aes(volatile.acidity), data = df_wine)+
geom_histogram()+
scale_x_continuous(lim = c(0,1.4))+
xlab("Volatile Aciity")+
ylab("Count")+
ggtitle("Histogram of Volatile Acidity and Count")
summary(df_wine$citric.acid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
ggplot(aes(citric.acid), data = df_wine)+
geom_histogram()+
scale_x_continuous(lim = c(0,0.80))+
xlab("Citric Acid")+
ylab("Count")+
ggtitle("Histogram of Citric Acid and Count")
summary(df_wine$residual.sugar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
ggplot(aes(residual.sugar), data = df_wine)+
geom_histogram()+
scale_x_continuous(lim = c(0,8))+
xlab("Residual Sugar")+
ylab("Count")+
ggtitle("Histogram of Residual Sugar and Count")
summary(df_wine$chlorides)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
ggplot(aes(chlorides), data = df_wine)+
geom_histogram()+
scale_x_continuous(lim = c(0,0.3))+
xlab("Chlorides")+
ylab("Count")+
ggtitle("Histogram of Chlorides and Count")
summary(df_wine$free.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
ggplot(aes(free.sulfur.dioxide), data = df_wine)+
geom_histogram()+
xlab("Free Sulfur Dioxide")+
ylab("Count")+
ggtitle("Histogram of Free Sulfur Dioxide and Count")
summary(df_wine$total.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
ggplot(aes(total.sulfur.dioxide), data = df_wine)+
geom_histogram()+
scale_x_continuous(lim = c(0,175))+
xlab("Total Sulfur Dioxide")+
ylab("Count")+
ggtitle("Histogram of Total Sulfur Dioxide and Count")
summary(df_wine$density)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
ggplot(aes(density), data = df_wine)+
geom_histogram()+
xlab("Density")+
ylab("Count")+
ggtitle("Histogram of Density and Count")
summary(df_wine$ph)
## Length Class Mode
## 0 NULL NULL
ggplot(aes(pH), data = df_wine)+
geom_histogram()+
xlab("pH")+
ylab("Count")+
ggtitle("Histogram of pH and Count")
summary(df_wine$sulphates)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
ggplot(aes(sulphates), data = df_wine)+
geom_histogram()+
scale_x_continuous(lim = c(0,1.5))+
xlab("Sulphates")+
ylab("Count")+
ggtitle("Histogram of Sulphates and Count")
summary(df_wine$alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
ggplot(aes(alcohol), data = df_wine)+
geom_histogram()+
xlab("Alcohol")+
ylab("Count")+
ggtitle("Histogram of Alcohol and Count")
summary(df_wine$quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
ggplot(df_wine, aes(x=factor(quality))) + geom_bar() +
xlab("Quality")+
ylab("Count")+
ggtitle("Histogram of Quality and Count")
There are 1599 observations in the dataset, also there are 12 variable(excludiong x).
My main feature of interest in this dataset is to explore how quality is influenced by other factors
All features in this dataset may help support my investigation
Creating new variable named ratings for the further analysis. The scale of the rating is as follows: Rating Bad (0-4), Rating Average (5-7), Rating Good(8-10)
#Converting the Quality from an Integer to a Factor
df_wine$quality <- factor(df_wine$quality, ordered = T)
#Creating a new Factored Variable called 'Ratings'
df_wine$ratings <- ifelse(df_wine$quality <= 4, 'bad', ifelse(
df_wine$quality <= 7, 'average', 'good'))
#Ordering
df_wine$ratings <- ordered(df_wine$ratings, levels = c('bad', 'average', 'good'))
summary(df_wine$ratings)
## bad average good
## 63 1518 18
ggplot(aes(x = ratings), data = df_wine)+
geom_bar()
We can observe most of the wine are in average rating range
ggplot(aes(ratings, fixed.acidity), data = df_wine) + geom_boxplot(alpha = .5) +
geom_jitter( alpha = .3) +
stat_summary(fun.y = "mean", geom = "point",color = "blue", shape = 8,size = 5)+
ylab("Fixed Acidity") +
ggtitle("Fixed acidity v/s ratings")
ggplot(aes(quality, fixed.acidity), data = df_wine) + geom_boxplot(alpha = .5) +
geom_jitter( alpha = .3) +
stat_summary(fun.y = "mean", geom = "point",color = "blue", shape = 8,size = 5)+
ylab("Fixed Acidity") +
ggtitle("Fixed acidity v/s quality")
Observations: I dont see any pattern of fixed acidity affecting the quality of the wine. May be from the observation we can say fixed acidity may not have any influence in quality of the wine
ggplot(aes(ratings, volatile.acidity), data = df_wine) + geom_boxplot(alpha = .5) +
geom_jitter( alpha = .3) +
stat_summary(fun.y = "mean", geom = "point",color = "blue", shape = 8,size = 5)+
ylab("Volatile Acidity") +
ggtitle("Volatile acidity v/s ratings")
ggplot(aes(quality, volatile.acidity), data = df_wine) + geom_boxplot(alpha = .5) +
geom_jitter( alpha = .3) +
stat_summary(fun.y = "mean", geom = "point",color = "blue", shape = 8,size = 5)+
ylab("Volatile Acidity") +
ggtitle("Volatile acidity v/s quality")
Observations Upon investigating graph, we can clearly observe that, the less the volatile acidity, the more is the quality of the wine. So to be clear, an ideal wine should have less volatile acidity.
ggplot(aes(ratings, citric.acid), data = df_wine) + geom_boxplot(alpha = .5) +
geom_jitter( alpha = .3) +
stat_summary(fun.y = "mean", geom = "point",color = "blue", shape = 8,size = 5)+
ylab("Citric Acid") +
ggtitle("Citric Acid v/s ratings")
ggplot(aes(quality, citric.acid), data = df_wine) + geom_boxplot(alpha = .5) +
geom_jitter( alpha = .3) +
stat_summary(fun.y = "mean", geom = "point",color = "blue", shape = 8,size = 5)+
ylab("Citric Acid") +
ggtitle("Citric Acid v/s quality")
Observations: Upon investigation, we can find that citric acid has a positive impact on wine quality and hence we can say more the citric acid concentration, the better is its quality
ggplot(aes(ratings, residual.sugar), data = df_wine) + geom_boxplot(alpha = .5) +
geom_jitter( alpha = .3) +
scale_y_continuous(lim = c(0.5,4)) +
stat_summary(fun.y = "mean", geom = "point",color = "blue", shape = 8,size = 5)+
ylab("Residual Sugar") +
ggtitle("Residual Sugar v/s ratings")
ggplot(aes(quality, residual.sugar), data = df_wine) + geom_boxplot(alpha = .5) +
geom_jitter( alpha = .3) +
scale_y_continuous(lim = c(0.5,4)) +
stat_summary(fun.y = "mean", geom = "point",color = "blue", shape = 8,size = 5)+
ylab("Residual Sugar") +
ggtitle("Residual Sugar v/s quality")
Observations The Residual sugar impact on quality is not clear in the above observation, hence cannot come to a cunclusion. Further more here in the above plot, i have removed outliers for better plot quality
ggplot(aes(ratings, chlorides), data = df_wine) + geom_boxplot(alpha = .5) +
geom_jitter( alpha = .3) +
scale_y_continuous(lim = c(0,.2)) +
stat_summary(fun.y = "mean", geom = "point",color = "blue", shape = 8,size = 5)+
ylab("Chlorides") +
ggtitle("Chlorides v/s ratings")
ggplot(aes(quality, chlorides), data = df_wine) + geom_boxplot(alpha = .5) +
geom_jitter( alpha = .3) +
scale_y_continuous(lim = c(0,.2)) +
stat_summary(fun.y = "mean", geom = "point",color = "blue", shape = 8,size = 5)+
ylab("Chlorides") +
ggtitle("Chlorides v/s quality")
Observations The Chloride impact on quality is not fully clear in the above observation but can say that less the chloride concentration, the better may be the quality of wine. Further, i have excluded some outliers for better plot quality
ggplot(aes(ratings, free.sulfur.dioxide), data = df_wine) + geom_boxplot(alpha = .5) +
geom_jitter( alpha = .3) +
scale_y_continuous(lim = c(0,45)) +
stat_summary(fun.y = "mean", geom = "point",color = "blue", shape = 8,size = 5)+
ylab("Free Sulfur Dioxide") +
ggtitle("Free Sulfur Dioxide v/s ratings")
ggplot(aes(quality, free.sulfur.dioxide), data = df_wine) + geom_boxplot(alpha = .5) +
geom_jitter( alpha = .3) +
scale_y_continuous(lim = c(0,45)) +
stat_summary(fun.y = "mean", geom = "point",color = "blue", shape = 8,size = 5)+
ylab("Free Sulfur Dioxide") +
ggtitle("Free Sulfur Dioxide v/s quality")
Observations: The Free Sulfur dioxide impact on quality is not clear in the above observation, hence cannot come to a cunclusion. Further more here in the above plot, i have removed outliers for better plot quality
ggplot(aes(ratings, total.sulfur.dioxide), data = df_wine) + geom_boxplot(alpha = .5) +
geom_jitter( alpha = .2) +
scale_y_continuous(lim = c(0,150)) +
stat_summary(fun.y = "mean", geom = "point",color = "blue", shape = 8,size = 5)+
ylab("Total Sulfur Dioxide") +
ggtitle("Total Sulfur Dioxide v/s ratings")
ggplot(aes(quality, total.sulfur.dioxide), data = df_wine) + geom_boxplot(alpha = .5) +
geom_jitter( alpha = .2) +
scale_y_continuous(lim = c(0,150)) +
stat_summary(fun.y = "mean", geom = "point",color = "blue", shape = 8,size = 5)+
ylab("Total Sulfur Dioxide") +
ggtitle("Total Sulfur Dioxide v/s quality")
Observations: Although Total Sulfur dioxide impact on quality is not clear , we can say that for good quality the total sulfer dioxide may be higher than 30, but cannot come to cunclusion as the plot doesnot reveal any patterns. Further more here in the above plot, i have removed outliers for better plot quality
ggplot(aes(quality, density), data = df_wine) + geom_boxplot(alpha = .5) +
geom_jitter( alpha = .2) +
stat_summary(fun.y = "mean", geom = "point",color = "blue", shape = 8,size = 5)+
ylab("Density") +
ggtitle("Density v/s quality")
Observations: The less the density of wine, the more will be the quality of wine
ggplot(aes(ratings, pH), data = df_wine) + geom_boxplot(alpha = .5) +
geom_jitter( alpha = .2) +
stat_summary(fun.y = "mean", geom = "point",color = "blue", shape = 8,size = 5)+
ylab("pH") +
ggtitle("pH v/s ratings")
ggplot(aes(quality, pH), data = df_wine) + geom_boxplot(alpha = .5) +
geom_jitter( alpha = .2) +
stat_summary(fun.y = "mean", geom = "point",color = "blue", shape = 8,size = 5)+
ylab("pH") +
ggtitle("pH v/s quality")
Observations: The less the pH, the more the quality of wine
ggplot(aes(ratings, sulphates), data = df_wine) + geom_boxplot(alpha = .5) +
geom_jitter( alpha = .2) +
stat_summary(fun.y = "mean", geom = "point",color = "blue", shape = 8,size = 5)+
ylab("Sulphates") +
ggtitle("Sulphates v/s ratings")
ggplot(aes(quality, sulphates), data = df_wine) + geom_boxplot(alpha = .5) +
geom_jitter( alpha = .2) +
stat_summary(fun.y = "mean", geom = "point",color = "blue", shape = 8,size = 5)+
ylab("Sulphates") +
ggtitle("Sulphates v/s quality")
Observations: The more the suphate concentration, the quality of wine increses
ggplot(aes(ratings, alcohol), data = df_wine) + geom_boxplot(alpha = .5) +
geom_jitter( alpha = .2) +
stat_summary(fun.y = "mean", geom = "point",color = "blue", shape = 8,size = 5)+
ylab("Alcohol") +
ggtitle("Alcohol v/s ratings")
ggplot(aes(quality, alcohol), data = df_wine) + geom_boxplot(alpha = .5) +
geom_jitter( alpha = .2) +
stat_summary(fun.y = "mean", geom = "point",color = "blue", shape = 8,size = 5)+
ylab("Alcohol") +
ggtitle("Alcohol v/s quality")
Observations The more the alcohol present, the more is the quality of wine
Observations
ggplot(aes(y = density, x = alcohol,color = quality),data = df_wine) +
geom_point() +
scale_color_brewer()+
theme_dark()+
xlab("Alcohol")+
ylab("Density")+
ggtitle("Alcohol and Density with respect quality")
ggplot(aes(y = density, x = alcohol,color = quality),data = df_wine) +
facet_wrap(~ratings)+
geom_point() +
scale_color_brewer()+
theme_dark()+
xlab("Alcohol")+
ylab("Density")+
ggtitle("Alcohol and Density with respect quality")
Observations: No clear observations the density affects quality when alcohol is kept constant
ggplot(aes(y = sulphates, x = alcohol,color = quality), data = df_wine) +
geom_point() +
scale_y_continuous(limits=c(0.3,1.5)) +
scale_color_brewer()+
theme_dark()+
xlab("Alcohol")+
ylab("Sulphates")+
ggtitle("Alcohol and Sulphates with respect quality")
ggplot(aes(y = sulphates, x = alcohol,color = quality), data = df_wine) +
geom_point() +
scale_y_continuous(limits=c(0.3,1.5)) +
facet_wrap(~ratings) +
scale_color_brewer()+
theme_dark()+
xlab("Alcohol")+
ylab("Sulphates")+
ggtitle("Alcohol and Sulphates with respect quality")
Observations : It seems that, the higher alcoholic the wine gets, the more sulphate it contains, Also it has positive impact on quality
ggplot(aes(y = pH, x = alcohol,color = quality),data = df_wine)+
geom_point() +
scale_color_brewer()+
theme_dark()+
xlab("Alcohol")+
ylab("pH")+
ggtitle("Alcohol and pH with respect quality")
ggplot(aes(y = pH, x = alcohol,color = quality),data = df_wine)+
geom_point() +
facet_wrap(~ratings) +
scale_color_brewer()+
theme_dark()+
xlab("Alcohol")+
ylab("pH")+
ggtitle("Alcohol and pH with respect quality")
Observations: It can be observed that low ph concentrations and high alcoholic contain makes quality wine
ggplot(aes(y = volatile.acidity, x = alcohol,color = quality),data = df_wine) +
geom_point() +
scale_color_brewer()+
theme_dark()+
xlab("Alcohol")+
ylab("Volatile Acidity")+
ggtitle("Alcohol and Volatile acidity with respect quality")
ggplot(aes(y = volatile.acidity, x = alcohol,color = quality),data = df_wine) +
geom_point() +
facet_wrap(~ratings) +
scale_color_brewer()+
theme_dark()+
xlab("Alcohol")+
ylab("Volatile Acidity")+
ggtitle("Alcohol and Volatile acidity with respect quality")
Observations: It can be observed that, lower the volatile acidity, and higher alcholic concentration makes quality wine
More Sulphate and High Alcoholic wine makes better quality wine
Low pH and high alcoholic wine makes better quality wine
Lower the volatile acidity, and higher alcholic concentration makes quality wine
No, As i am not confertable with ML.
This dataset provided information on Red wine collected by the company Vhino verde. I started exploring data and found some interesting observation. I started exploring data by first importing the required libraries and by impoting the data which was in “.csv” format. I did some Univariate analysis to understand each attributes in the dataset. Bivariate analysis gave a lot of information about the data and also it helped me to find insights on how the quality of wine is affected by various factors. I wanted to know more about how 2 or more factors affect quality of wine, hence performed multivariate analysis which results of it are explained above. The future work on the project that can be done include building Supervised learning models that helps company to manufacture quality wines.